Mathematical Modelling of Generalization
نویسنده
چکیده
This paper surveys certain developments in the use of probabilistic techniques for the modelling of generalization. Some of the main methods and key results are discussed. Many details are omitted, the aim being to give a high-level overview of the types of approaches taken and methods used. 1 Probabilistic Modelling of Learning Suppose that X is a set of examples and that Y ⊆ [0, 1] is a set of possible outputs. Elements (x, y) of Z = X × Y will be called labelled examples. In the model, we shall assume that a learning algorithm A takes a randomly generated training sample of labelled examples and produces a function h : X → [0, 1], chosen from some hypothesis class H of functions. We assume that there is some fixed, but unknown, probability measure μ on Z, and that each training example is generated independently according to μ. A learning algorithm is a function A : ⋃∞ n=1 Z n → H, where H is a hypothesis class of functions from Z to [0, 1]. We have in mind some loss function ` : [0, 1] × Y → [0, 1]. Examples of loss functions are `(r, s) = |r − s|, `(r, s) = (r − s), and the discrete loss, given by `(r, s) = 0 if r = s and `(r, s) = 1 if r 6= s. What we hope for is that A(z) has a relatively small loss, where, for h ∈ H, the loss of h is the expectation L(h) = E `(h(x), y) (where the expectation is with respect to μ). Since the best loss one could hope to be near is L∗ = infh∈H L(h), we want A(z) to have loss close to L∗, with high probability, provided the sample size n is large enough. (Here, and in the rest of the paper, we use the symbol P to denote probability. In the definition that follows, the probability is with respect to μ.) This definition has its origins in [35, 33, 32, 19]. (See also the books [3, 4, 21, 36].) We say that A is a successful learning algorithm for H if for all , δ ∈ (0, 1), there is some n0( , δ) (depending on and δ only) such that, if n > n0( , δ), then with probability at least 1 − δ, L(A(z)) ≤ L∗ + . Note that if A is successful, then there is some function 0(n, δ) of n and δ, with the property that for all δ, limn→∞ 0(n, δ) = 0, and such that for any probability measure μ on Z, with 1 Certain measurability conditions are implicitly assumed in what follows, but these conditions are reasonable and not particularly stringent. Details may be found in [31] for instance. probability at least 1− δ we have L(A(z)) ≤ L∗+ 0(n, δ). The minimal 0(n, δ) is called the estimation error of the algorithm. When H is a set of binary functions, meaning each function in H maps into {0, 1}, if Y = {0, 1}, and if we use the discrete loss function, then we shall say that we have a binary learning problem. We might want to use real functions for classification. Here, we would have Y = {0, 1}, but H : X → [0, 1]. In this case, one appropriate loss function would be given, for r ∈ [0, 1] and s ∈ {0, 1}, by `(r, s) = 0 if r − 1/2 and s− 1/2 have the same sign, and `(r, s) = 1 otherwise. We call this the threshold loss. Thus, with respect to the threshold loss, `(h(x), y) ∈ {0, 1} is 0 precisely when the thresholded function Th : x 7→ sign(h(x) − 1/2) has value y. There is some advantage in considering the margin of classification by these real-valued hypotheses (a fact that has been emphasised for some time in pattern recognition and learning [34], and which is very important in Support Vector Machines [13].) Explicitly, suppose that γ > 0, and for r ∈ [0, 1], define mar(r, 1) = r − 1/2 and mar(r, 0) = 1/2 − r. The margin of h ∈ H on z = (x, y) ∈ Z × {0, 1} is defined to be mar(f(x), y). Now, define the loss function ` by `(r, s) = 1 if mar(r, s) < γ and `(r, s) = 0 if mar(r, s) ≥ γ. The corresponding loss L(h) of a hypothesis is called the loss of h at margin γ. We say (as in [3]) that A : (0, 1) × ⋃∞ n=1 Z n → H is a successful real-valued classification algorithm if for all , δ ∈ (0, 1) there is n0( , δ) such that, if n > n0( , δ), then with probability at least 1− δ, L(A(γ, z)) ≤ infh∈H L(h) + .
منابع مشابه
Generalization of Titchmarsh's Theorem for the Dunkl Transform
Using a generalized spherical mean operator, we obtain a generalization of Titchmarsh's theorem for the Dunkl transform for functions satisfying the ('; p)-Dunkl Lipschitz condition in the space Lp(Rd;wl(x)dx), 1 < p 6 2, where wl is a weight function invariant under the action of an associated re ection group.
متن کاملGENERALIZATION OF TITCHMARSH'S THEOREM FOR THE DUNKL TRANSFORM IN THE SPACE $L^P(R)$
In this paper, using a generalized Dunkl translation operator, we obtain a generalization of Titchmarsh's Theorem for the Dunkl transform for functions satisfying the$(psi,p)$-Lipschitz Dunkl condition in the space $mathrm{L}_{p,alpha}=mathrm{L}^{p}(mathbb{R},|x|^{2alpha+1}dx)$, where $alpha>-frac{1}{2}$.
متن کاملPartially continuous pretopological and topological operators for intuitionistic fuzzy sets
In this paper, pretopological and topological operators are introduced based on partially continuous linear transformations of the membership and non-membership functions for intuitionistic fuzzy sets. They turn out to be a generalization of the topological operators for intuitionistic fuzzy sets.On the other hand it is a generalization of the fuzzy set pretopological operators introduced...
متن کاملGENERALIZATION OF TITCHMARSH'S THEOREM FOR THE GENERALIZED FOURIER-BESSEL TRANSFORM
In this paper, using a generalized translation operator, we prove theestimates for the generalized Fourier-Bessel transform in the space L2 on certainclasses of functions.
متن کامل(DELTA,GAMMA, 2)-BESSEL LIPSCHITZ FUNCTIONS IN THE SPACE L_{2,ALPHA}(R+)
Using a generalized translation operator, we obtain a generalization of Theorem 5 in [4] for the Bessel transform for functions satisfying the (delta;gamma ; 2)-BesselLipschitz condition in L_{2;alpha}(R+).
متن کاملGeneralization of Dynamic Two Stage Models in DEA: An Application in Saderat Bank
Dynamic network data envelopment analysis (DNDEA) has attracted a lot of attention in recent years. On one hand the available models in DNDEA evaluating the performance of a DMU with interrelated processes during specified multiple periods but on the other hand they can only measure the efficiency of dynamic network structure when a supply chain structure present. For example, in the banking in...
متن کامل